68 research outputs found
Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos
Detecting manipulated images and videos is an important topic in digital
media forensics. Most detection methods use binary classification to determine
the probability of a query being manipulated. Another important topic is
locating manipulated regions (i.e., performing segmentation), which are mostly
created by three commonly used attacks: removal, copy-move, and splicing. We
have designed a convolutional neural network that uses the multi-task learning
approach to simultaneously detect manipulated images and videos and locate the
manipulated regions for each query. Information gained by performing one task
is shared with the other task and thereby enhance the performance of both
tasks. A semi-supervised learning approach is used to improve the network's
generability. The network includes an encoder and a Y-shaped decoder.
Activation of the encoded features is used for the binary classification. The
output of one branch of the decoder is used for segmenting the manipulated
regions while that of the other branch is used for reconstructing the input,
which helps improve overall performance. Experiments using the FaceForensics
and FaceForensics++ databases demonstrated the network's effectiveness against
facial reenactment attacks and face swapping attacks as well as its ability to
deal with the mismatch condition for previously seen attacks. Moreover,
fine-tuning using just a small amount of data enables the network to deal with
unseen attacks.Comment: Accepted to be Published in Proceedings of the IEEE International
Conference on Biometrics: Theory, Applications and Systems (BTAS) 2019,
Florida, US
Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data
Thanks to the growing availability of spoofing databases and rapid advances
in using them, systems for detecting voice spoofing attacks are becoming more
and more capable, and error rates close to zero are being reached for the
ASVspoof2015 database. However, speech synthesis and voice conversion paradigms
that are not considered in the ASVspoof2015 database are appearing. Such
examples include direct waveform modelling and generative adversarial networks.
We also need to investigate the feasibility of training spoofing systems using
only low-quality found data. For that purpose, we developed a generative
adversarial network-based speech enhancement system that improves the quality
of speech data found in publicly available sources. Using the enhanced data, we
trained state-of-the-art text-to-speech and voice conversion models and
evaluated them in terms of perceptual speech quality and speaker similarity.
The results show that the enhancement models significantly improved the SNR of
low-quality degraded data found in publicly available sources and that they
significantly improved the perceptual cleanliness of the source speech without
significantly degrading the naturalness of the voice. However, the results also
show limitations when generating speech with the low-quality found data.Comment: conference manuscript submitted to Speaker Odyssey 201
Audiovisual Speaker Conversion: Jointly and Simultaneously Transforming Facial Expression and Acoustic Characteristics
An audiovisual speaker conversion method is presented for simultaneously
transforming the facial expressions and voice of a source speaker into those of
a target speaker. Transforming the facial and acoustic features together makes
it possible for the converted voice and facial expressions to be highly
correlated and for the generated target speaker to appear and sound natural. It
uses three neural networks: a conversion network that fuses and transforms the
facial and acoustic features, a waveform generation network that produces the
waveform from both the converted facial and acoustic features, and an image
reconstruction network that outputs an RGB facial image also based on both the
converted features. The results of experiments using an emotional audiovisual
database showed that the proposed method achieved significantly higher
naturalness compared with one that separately transformed acoustic and facial
features.Comment: Submitted to ICASSP 201
High-Quality Nonparallel Voice Conversion Based On Cycle-Consistent Adversarial Network
Although voice conversion (VC) algorithms have achieved remarkable success
along with the development of machine learning, superior performance is still
difficult to achieve when using nonparallel data. In this paper, we propose
using a cycle-consistent adversarial network (CycleGAN) for nonparallel
data-based VC training. A CycleGAN is a generative adversarial network (GAN)
originally developed for unpaired image-to-image translation. A subjective
evaluation of inter-gender conversion demonstrated that the proposed method
significantly outperformed a method based on the Merlin open source neural
network speech synthesis system (a parallel VC system adapted for our setup)
and a GAN-based parallel VC system. This is the first research to show that the
performance of a nonparallel VC method can exceed that of state-of-the-art
parallel VC methods.Comment: accepted at ICASSP 201
- …